Genetic insights into agronomic and morphological traits of drug-type cannabis revealed by genome-wide association studies

Cannabis sativa L., previously concealed by prohibition, is now a versatile and promising plant, thanks to recent legalization, opening doors for medical research and industry growth. However, years of prohibition have left the Cannabis research community lagging behind in understanding Cannabis genetics and trait inheritance compared to other major crops. To address this gap, we conducted a comprehensive genome-wide association study (GWAS) of nine key agronomic and morphological traits, using a panel of 176 drug-type Cannabis accessions from the Canadian legal market. Utilizing high-density genotyping-by-sequencing (HD-GBS), we successfully generated dense genotyping data in Cannabis, resulting in a catalog of 800 K genetic variants, of which 282 K common variants were retained for GWAS analysis. Through GWAS analysis, we identified 18 markers significantly associated with agronomic and morphological traits. Several identified markers exert a substantial phenotypic impact, guided us to putative candidate genes that reside in high linkage-disequilibrium (LD) with the markers. These findings lay a solid foundation for an innovative cannabis research, leveraging genetic markers to inform breeding programs aimed at meeting diverse needs in the industry.

GWAS studies in hemp-type Cannabis [31][32][33] to investigate fiber quality, flowering time and sex determination and drug-type Cannabis 34 to investigate genetic basis of terpenes have enabled identification of significant genetic markers.The newly identified QTL will enable the early selection of promising individuals through markerassisted selection (MAS) 35 , thereby reducing the labor and costs associated with development of improved varieties.Genetic association studies are, therefore, of significant value in advancing breeding programs towards molecular approaches 23 .
While flowering time and sex determination have been focal points in Cannabis breeding, the genetic basis of other important agronomic traits (e.g., yield, height, days to maturity, etc.) remain largely unexplored.Morphological traits should be duly considered due to their established intercorrelations with yield, maturity and cannabinoid profiles 36 .For instance, Cannabis plants cultivated for medicinal and recreational application exhibit shorter stature, have thinner stems, more nodes, higher floral density, and a different cannabinoid profiles compared to industrial hemp plants 37 .On the other hand, genetic backgrounds that prioritize yield may negatively impact THC production, and vice versa 36 .Investigating genetic variations associated with agronomic and morphological traits is essential for establishing the genetic groundwork for developing tailor-made Cannabis varieties, along with breeding tools such as MAS and genomic selection (GS) 38 .
To facilitate the development of molecular tools for Cannabis breeders and researchers, the present study provides high-value markers linked to essential agronomic and morphological traits, identified through GWAS conducted on 176 drug-type Cannabis accessions from the Canadian legal market.Markers associated with essential traits were identified using the multi-locus statistical method Bayesian-information and linkagedisequilibrium iteratively nested keyway (BLINK) 39 .In summary, this study lays the groundwork for a comprehensive understanding of the genetic foundations underpinning the agronomic and morphological traits in Cannabis.The markers identified through this research promise to significantly expedite breeding efforts, empowering us to cultivate Cannabis varieties optimized for various purposes and applications.

Plant material and phenotyping data
All research activities, including the procurement and cultivation of Cannabis plants, were executed in accordance with our Cannabis research license (LIC-QX0ZJC7SIP-2021) and in full compliance with Health Canada's regulations.In total, in this study, we used 176 drug-type accessions each accompanied by phenotyping data sourced from Lapierre et al. 1 .These accessions were selected from diverse sources to ensure representation of the broad spectrum of the drug-type Cannabis varieties available in the legal market of Canada (Supplementary Table S1).
In this study, we used four key productivity-related traits, including fresh biomass (FB; whole Cannabis plant excluding the roots), dried flower weight (DFW; representing yield), sexual maturity (SM; defined as the stage at which the first floral bud could be observed at the base of an axillary stem prior to the initiation of flowering) and harvest maturity (HM; days to maturity).Additionally, we included five morphological traits, namely stem diameter (SD), canopy diameter (CD), height, internode length index (ILI) and node counts (NC).It is worth noting that values were originally recorded in inches and were converted to centimeter for consistency.Histograms representing the distribution of each trait for the 176 accessions were generated using R v4.2.1 40 with the 'hist' function.Furthermore, a t-test was performed to determine whether the minimum and maximum values of each trait significantly differed from the overall population mean.

DNA isolation, library preparation and sequencing
Approximately 50 mg of young leaf tissue from each accession was collected for DNA extraction.The collected leaf tissues were air-dried for four days using a desiccating agent (Drierite; Xenia, OH, USA) and then ground with metallic beads in a RETSCH MM 400 mixer mill (Fisher Scientific, MA, USA).DNA extraction was carried out using the CTAB-chloroform protocol 41 .In brief, the powdered tissue was treated with a CTAB buffer solution, followed by a phenol-chloroform extraction procedure.The resulting DNA pellet underwent ethanol washing and was subsequently re-suspended in water.DNA quantification was carried out using a Qubit fluorometer with the dsDNA HS assay kit (Thermo Fisher Scientific, MA, USA), and concentrations were adjusted to 10 ng/μl for all samples.Final DNA samples were used to prepare HD-GBS libraries with BfaI as described in Torkamaneh et al. 29 at the Institut de biologie intégrative et des systèmes (IBIS), Université Laval, QC, Canada.Sequencing was conducted on an Illumina NovaSeq 6000 (Illumina, CA, USA) with 150 paired-end reads at the Genome Quebec Service and Expertise Center (CESGQ), Montreal, QC, Canada.

SNP calling and filtration
Sequencing data were processed with the Fast-GBS v2.0 42 using the C. sativa cs10 v2 reference genome (GenBank acc.no.GCA_900626175.2) 15 .For variant calling a prerequisite of a minimum of 6 reads to call a single nucleotide polymorphism (SNP) was opted.Raw SNP data were filtered with VCFtools v0.1.16 43to remove low-quality SNPs (QUAL < 10 and MQ < 30) and variants with proportion of missing data exceeding 80%.Missing data imputation was performed with BEAGLE 4.1 44 , followed by a second round of filtration, retaining only biallelic variants with heterozygosity less than 50% and a minor allele frequency (MAF) of > 0.06.Additionally, variants residing on unassembled scaffolds were removed.The resulting catalog of ~ 282 K SNPs was used to conducted genetic analysis, population structure assessment and GWAS (Supplementary Tables S2).

Marker description
Read counts and coverage were calculated with SAMtools "coverage" parameter 45 .Proportion of heterozygous variants and MAF were estimated using TASSEL5 46 .The proportion of SNPs located within annotated genes was determined with BEDTools 47 by analyzing the number of SNPs overlapping with gene regions 48 (Supplementary Table S3).To visualize the distribution of SNP density, a plot was produced with rMVP 22 using 'plot.type= "d"' parameter, in combination with the gene density distribution.The nucleotide diversity (π) 49 was measured in a sliding windows of 1000 bp across the genome using-window-pi option of VCFtools 43 .Similarly, the pairwise π was calculated among different clusters.

LD decay and Haplotype block
Pairwise-LD was calculated with PLINK v1.9 50 using '-r2 -ld-window-r2 0' parameters.Long-range LD, measured as the allele frequency correlation (r 2 ), was determined for all pairwise SNPs within each chromosome independently (Supplementary Table S3).The LD decay curve line was fitted on the scatterplot using the smoothing spline regression following the procedure of Remington et al. 51 in the R environment (Fig. 2b).The point of intersection between the LD curve and the predefined r 2 threshold determined the LD decay.Estimation of haplotype blocks (HBs) was performed with PLINK v1.9 using '-blocks no-pheno-req -ld-window-kb 999' .A t-test was conducted in R to assess whether the LD decay of the chromosome X significantly differed from that of other chromosomes.

Population structure analysis
Population structure and admixture Population admixture was determined using a variational Bayesian inference algorithm implemented in fastStructure v1.0 52 for a number of subpopulations (K) set from 1 to 10.The optimal number of K (i.e., 3) explaining the population complexity was estimated using the ChooseK tool from fastStructure and admixture proportions were visualized using Distruct v2.3 (Fig. 2c, Supplementary Fig. S1).The kinship matrix (K*) was generated using TASSEL5 with the Centered_IBS method and plotted with GAPIT v3 21 (Supplementary Fig. S3).

Discriminant analyses of principal components (DAPC) for population structure
Population structure was further investigated using discriminant analyses of principal components (DAPC) 53 using the R package 'adegenet' version 2.1.10.The number of cluster was estimated using 'find.cluster'function with a maximum limit set to 40 clusters and 200 principal components (PCs) (Fig. 2d).Optimal number of clusters (i.e., K = 3) was determined by the minimal Bayesian Information Criterion (BIC) value for different numbers of K (Supplemental Fig. S2ab).To visualize the DAPC using the 'scatter' function, the optimal number of PCs was estimated with two cross-validation procedures using 'optim.a.score' (i.e., PCs = 20, Supplemental Fig. S2c) and 'xvalDapc' (i.e., PCs ≤ 20, Supplemental Fig. S2d).

Comparison of population assignments and trait analysis
Cluster assignments from both fastStructure and DAPC were compared using the 'table' function for a K value of 3 and 6 (Supplementary Fig. S4).An analysis of variance (ANOVA) and permutational ANOVA (PERMANOVA) were performed for traits following and deviating from the normal distribution, respectively, using the 'adonis2' function from R package 'vegan' .Cluster assignments obtained from fastStructure were used as covariate.In cases where ANOVA/PERMANOVA indicated a significant difference, the post-hoc Tukey honestly significant difference (HSD) test was performed to determine which pairs were significantly different.Violin plots were generated with 'ggplot2' in R and Tukey significant differences were represented by letter (Supplementary Fig. S5).

Genome-wide association analysis
Marker-trait association analysis was performed using the method BLINK 39 in GAPIT v3 21 , using the 282 K highquality SNPs and the phenotyping data for nine different traits.The identification of false positive was minimized by incorporating population structure (i.e., P matrix generated with fastStructure for K = 3) and kinship (i.e., K* matrix generated with TASSEL5) for the analysis.The threshold of significance for marker-trait associations in both methods was set to ensure a false discovery rate < 0.05, adjusted with a Benjamini-Hochberg correction.Markers with a phenotypic variance explained (PVE) less than 3% were excluded from the analysis as they were considered uninformative and of limited interest.Manhattan plots showing -log 10 (p) distribution of markers by chromosome were generated with rMVP 22 using 'plot.type= "m"' and quantile-quantile (QQ) plots were created with GAPIT v3 21 (Supplementary Fig. S6).Boxplot of the allelic classes of significant markers were generated with 'ggplot2' in R (Supplementary Fig. S7).

Preliminary candidate gene identification
Due to the substantial genetic diversity present in Cannabis, only a limited number of SNPs exhibited a strong LD (r 2 ≥ 0.95).Therefore, to pinpoint genetic regions of interest, only markers in high LD (r 2 ≥ 0.75) with significant markers were retained to define haplotype blocks (HBs).Markers failing to form HB and residing outside of genetic regions were removed from the candidate gene investigation.Genes located in the HBs (defined by the 5ʹ-most and 3ʹ-most marker of the HB) were considered as putative candidate genes.The gene ontology (GO) annotations of these candidate genes were examined based on the description provided by the NCBI Cannabis sativa Annotation Release 100.To further confirm and provide a more detailed functional annotation of www.nature.com/scientificreports/candidate genes, phylogenetic ortholog inferences were performed using OrthoFinder 54 with the Arabidopsis thaliana transcriptome (TAIR 11) 55 .

A broad range of phenotypic variation among the 176 drug-type accessions
The population displayed significant phenotypic diversity (p < 0.001) across the nine examined traits (Fig. 1, Supplemental Table S1).For instance, FB exhibited a substantial variation, ranging from 90 to 1260 g, while plant height varied between 22 and 109 cm.SM also showed a significant diversity, with individuals initiating the first flower bud between 20 and 68 days.With the exception of SM, all other traits displayed a unimodal distribution, suggesting a complex genetic control involving multiple QTL.Furthermore, these traits exhibited highly skewed distributions, indicating that some accessions may carry specific alleles or combinations of alleles exerting a substantial impact on these traits.This phenotypic diversity within the Cannabis accessions provides a robust foundation for GWAS, aligning with established criteria for successful GWAS outcomes 23 .

Genetic diversity in the GWAS-panel revealed by dense genotyping
To achieve comprehensive marker coverage across the Cannabis genome, an HD-GBS approach was used.Sequencing of HD-GBS libraries generated 486 M reads, averaging 2.8 M reads per sample.This extensive sequencing effort resulted in an average per-sample coverage of 7.7% of the cs10 v2 assembly, achieving a cumulative coverage of 34.1% across the entire genome for the entire population.The analysis of variant calling from our sequencing data initially yielded a substantial dataset of 2.7 M raw variants that met the quality criteria.
Following filtering for missing data and minor allele frequency (MAF of 1%), we successfully identified ~ 800 K polymorphic variants, with an overall proportion of missing data reaching 61% before imputation step.This SNP catalog meets the criteria required to perform a relevant missing data imputation 56 .Subsequently, we performed a secondary round of filtering, primarily aimed at retaining common variants, as defined by a MAF of 6%, retaining approximately 39% of the raw data.While this filtering step may exclude rare variants that could potentially influence complex traits, it is essential to reduce the risk of false-positive associations and ensure www.nature.com/scientificreports/ that a minimum of 10 accessions carries the significant allele, thereby preventing overfitting in GWAS models 23 .
The HD-GBS approach and the filtering procedures resulted in a catalog of 282 K high-quality SNPs (all details in Supplemental Table S2).Within this catalog, 25.5% of the genotypes were found to be heterozygous and the SNPs exhibited an average MAF of 21.7%.For a detailed overview of filtering steps and the number of variants retained at each stage, refer to Supplementary Table S3.Overall, this SNP catalog represents an extensive genetic resource for the subsequent GWAS and underscores the robustness of the genotyping strategy used in this study.Markers were exceptionally well distributed across the genome, ensuring coverage of gene-rich regions.On average, there was one marker per every ~ 3 kb of the genome, which significantly enhances the likelihood of identifying markers in strong LD with putative candidate genes or regions (Fig. 2a).Across the entire physical map, only 12 gaps exceeding 1 Mb, with the largest being 1.2 Mb, were identified.Comparing our dataset with the RAD-Seq method used in the study of Petit et al. 32,33 , by employing comparable filtration criteria, the HD-GBS approach yielded a comparable number of markers while utilizing only one-tenth of the sequencing efforts (averaging 2.8 M vs. 29.7 M reads per sample).Therefore, the density and genomic distribution of SNPs provided by the HD-GBS approach make it a cost-effective option for conducting GWAS on large Cannabis panel.Furthermore, this approach is compatible with the miniaturization of sequencing libraries using the NanoGBS procedure, which further contributes to substantial cost reduction in genotyping 57 .
The average extent of LD decay to its half ranged from 22.6 to 89.0 kb across different chromosomes (Fig. 2b).It is important to note that LD decay is a relative value and does not precisely reflect to reality recombination rates throughout the entire genome, particularly between heterochromatic and euchromatic regions 58 .However, this measure proved valuable for comparing the impact of domestication and selection on recombination rates among different populations.In this context, the LD observed in the GWAS-panel showed rapid decay compared to modern cultivars of comparable genome size, such as soybean (where LD may extend over 100 kb 59 ) and tomato (where LD can extend over 1 Mb 60 ).Nevertheless, LD decayed to its half more slowly compared to a recent study of 110 domesticated and landrace Cannabis accessions from various worldwide origins, where LD decayed over approximately 10 kb 61 .This resulted in a large number of small HBs with an average size of ~ 4 kb (Supplemental Table S3).It is worth noting that the LD decay on the sex chromosome was almost twice slower (p < 0.001) compared to autosomes.These observations were consistent with the recent history of Cannabis cultivation in Canada, characterized by extensive hybridization efforts by breeders with a particular focus on sexual characteristics, such as the production of female flowers 2 .loci through multi-locus analysis 39 .Furthermore, this method has proven its effectiveness with large catalog of SNPs [77][78][79] and was ranked as the most statistically powerful method for multi-locus analyzes for GWAS in plants 21,39,80 .As the identification of high-value markers for Cannabis is in its early stages, the practical implementation of these markers by breeding programs will nevertheless require preliminary cross-validation.This can be achieved through meta-GWAS 81 , QTL mapping with biparental population and BSA.Additionally, comprehensive functional analyses of the candidate genes will be crucial. .

Investigation of putative candidate genes
Among the 18 associated SNPs, 11 were in high LD (r 2 ≥ 0.75) with other SNPs, forming HBs (Table 2).Notably, SNP_9 and _17 were part of the same HB, spanning ~ 97 kb on chromosome 1.The SNP _26 was located within LOC115699444without forming HBs.The 11 HBs spanned ~ 250 kb, within which 21 annotated genes were identified.Consequently, these genes were considered as putative candidates genes associated with different traits.Recent genome annotation of cs10 48 facilitated the investigation of the functions of candidate genes (Table 2).An orthology analysis was conducted by comparing the protein sequences of candidate genes with the Arabidopsis proteome 55 .Functional annotations were similar for the majority of candidate genes and their respective orthologs, confirming the robustness of the functional annotation of the cs10 transcriptome.
The SNP_4, which showed associations with DFW, CD, and height, was found to be in high LD with LOC115722258, associated with chloroplast metabolism and mechanisms.This suggests a potential link between the genetic variation of SNP_4 and the observed variations in these morphological traits through their impact on chloroplast-related processesIn addition to structural genes, regulatory genes, such as transcription factors, were identified among the potential candidate genes (e.g., LOC115706624).Approximately one-third of the associated SNPs were not in high LD with putative candidate gene, but they might more likely linked to gene regulatory regions.The in-silico identification of regulatory regions and their interaction with a gene is challenging and complex to link associated SNPs and the phenotype.However, this does not diminish their importance, especially for markers SNP_7 and _8, which were associated with a substantial impact on HM (PVE > 30%).www.nature.com/scientificreports/These findings suggest that regulatory elements, such as transcription factors, may play a role in shaping the phenotypic variation in cultivated Cannabis.However, confirming the relevance of these candidate genes will still require further analysis.

Conclusion
In conclusion, this study marks a pioneering exploration of the genetic landscape of Canadian drug-type Cannabis through a comprehensive GWAS analysis, enriched by high-throughput genotyping and precise agronomic phenotyping data.Our findings open new avenues for advancing Cannabis breeding programs and addressing the diverse needs of emerging industries.The application of a high-density genotyping approach yielded an extensive catalog of high-quality SNPs, effectively capturing the genomic diversity of drug-type Cannabis.
The distribution of these markers across different chromosomes, coupled with high quality phenotypic data, facilitated the identification of molecular markers associated with complex agronomic and morphological traits.These markers hold great promise for further investigations to elucidate their functional links with phenotype variations, making them valuable assets for precision breeding efforts.
As we move forward, this research paves the way for in-depth studies to uncover the biological mechanisms governing these traits, potentially uncovering hidden genetic potential within Cannabis populations.Furthermore, the implications of our work extend beyond immediate applications, as the identified markers are poised to play a pivotal role in the development of tailor-made Cannabis cultivars, spanning both medicinal and recreational sectors, capable of meeting the dynamic demands of rapidly evolving industries.
Future perspectives in this domain encompass a deeper exploration of the candidate genes associated with the identified markers, seeking to unravel the intricate genetic and molecular underpinnings of these key traits.Additionally, functional validation experiments and expression profiling could elucidate the precise mechanisms through which these markers exert their effects.Collaborative efforts between academia and industry are essential to harness this newfound genetic knowledge and translate it into practical breeding strategies, ensuring the continued innovation and sustainability of the Cannabis crop.

Figure 1 .
Figure 1.Frequency distribution of phenotypic data for 176 drug-type accessions used in this study.For each trait, the minimum and maximum values significantly differed (t-test p < 0.001) from the overall population mean.

Figure 2 .
Figure 2. Genome-wide distribution of markers, linkage disequilibrium (LD) and population structure analysis.(a) Density plot of markers and genes across the genome.Colors represent the number of SNPs within 1 Mb window size.(b) LD decay in each chromosome where LD values of intra-chromosomal pairwise markers were plotted against physical distance.(c) Admixture plot for K = 3 using fastStructure.The vertical lines represent the accessions, and the y-axis represents the probability that an individual belongs to a subgroup.(d) Discriminant analysis of principal components (DAPC) scatter plot showing population structure.

Figure 3 .
Figure 3. Genome-wide association studies (GWAS) for nine agronomic and morphological traits in drug-type cannabis.Manhattan plot for productivity-related traits (a) and morphological traits (b).Each circle indicates the degree of association for a marker with a trait (y axis), while the x axis shows the physical position of each marker on a given chromosome across the genome.The horizontal grey line indicates the significance threshold (p-value = 1.77 × 10 −7 , false discovery rate < 0.05).Marker-trait associations were performed with the Bayesianinformation and linkage-disequilibrium iteratively nested keyway (BLINK) method. )

Table 1 .
List of markers associated with nine different traits in drug-type cannabis identified through GWAS.Marker-trait associations were performed with the Bayesian-information and linkage-disequilibrium iteratively nested keyway (BLINK) method.a MSS: Most significant SNPs.b MAF: Minor allele frequency.c PVE: Phenotypic variance explained.d Effect represent the allelic effect estimate of the major allele.